1 Introduction

After years of debate and controversy, The one-child policy of China was relaxed in 2013 and then replaced by a universal two-child policy in 2016. The concern was that the Chinese population was rapidly aging and that the one-child policy was not helping. But do the Chinese people really want to have more children under the new two-child policy? Some contend that having multiple children has always been a tradition in Chinese families. Growing up without siblings is a man-made anomoly caused by government policy. Growing up in a big family will likely encourage people to have multiple kids for themselves. However, others argue that rising cost of living and education will deter young couples from doing that, because they cannot afford supporting their aging parents and raising multiple kids at the same time. Also, is there a difference between rural and urban areas in China?

This project looks for potential relationships between family structure and the number of kids in China. Does family size follow “path-dependency”? If so, couples who come from large families should have more children. Or does large families create economic burdens that prevents couples from having multiple children? I will investigate those questions in this project.

Hypothesis 1: Growing up with more sigblings increases aptitude for more kids.

Hypothesis 2: High cost deter parents from having more kids. Higher income should increases parents’ aptitude to raise more kids.

Hypothesis 3: There are differences between different types of communites, including big cities, small towns, and rural areas.

Dependent variable: total child

Independent variables: total siblings, family income, community_type

2 The Data Set

The East Asian Social Survey (EASS) is a biennial social survey project that serves as a cross-national network of the following four General Social Survey type surveys in East Asia: Chinese General Social Survey (CGSS), Japanese General Social Survey (JGSS), Korean General Social Survey (KGSS), Taiwan Social Change Survey (TSCS), and comparatively examines diverse aspects of social life in these regions. Survey information in this module focuses on family dynamics and includes demographic variables such as the number of family members, the number of younger and older siblings, the number of sons and daughters, and whether family members are alive or deceased. Respondents were also queried about specific information pertaining to family members and children not co-residing with them, such as, sex and birth order, age, marital status, residence status, contact frequency, employment status, and relation to the respondent. Other information collected includes attitudes toward financial support from family members and how frequently financial and personal support was provided. Questions also include opinions regarding household chores, lifestyle preferences, health of respondent and parents, as well as family obligations. Quality of life questions addressed how satisfied respondents were as well as overall marital happiness. Demographic information specific to the respondent and their spouse includes age, sex, marital status, education, employment status and hours worked, occupation, earnings and income, religion, class, size of community, and region.

Dataset: East Asian Social Survey (EASS), Cross-National Survey Data Sets: Families in East Asia, 2006 (ICPSR 34606)

Link: https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34606

#Load and rename data
load("34606-0001-Data.rda")
EASS <- da34606.0001

For the purpose of this project, we are only concerned with ovservations from China.

#Create a subset with only observations from China
CHINA <- EASS %>%
  filter(V2 == "(1) CN-China") 

2.1 Data Structure

The filtered dataset includes 332 variables and 3208 observations in total. The variables include the following aspects of information:

Number of family members

Number of R’s brothers and sisters, Number of Spouse’s brothers and sisters

Number of sons and daughters

Family members : Relation to respondent, Sex, Age, Marital Status, Co-residence Status, Employment Status

Sons and daughters who do not currently live with you : Sex & birth order, Age, Marital status, Employment status, How far live, How often face-to-face contact, How often any other contact

R’s parents and Spouse’s parents : Alive or not, Age, Marital status, Employment Status, Living arrangement, How far live, How often face-to-face contact, How often any other contact

Other demographic information including education, income etc.

2.2 Variables of Interest

For our research question, several variables will be relevant to this project:

V10: Total number of siblings of respondent.

V16: Total number of siblings of spouse.

V17: Total number of sons.

V18: Total number of daughters.

cn_hinc: Household income: China

cn_size: Size of town/city: China

2.3 Cleaning Data

#Creating variable total_child
CHINA <- CHINA %>%
  mutate(total_child = V17 + V18)

#Creating variable total_siblings (from both of the parents)
CHINA <- CHINA %>%
  mutate(total_siblings = V10 +V16)

#Creating variable community_type

CHINA <- CHINA %>%
  mutate(CN_SIZE_numeric = as.numeric(as.factor(CN_SIZE))) 

CHINA$community_type <- factor(NA,levels = c("Rural","Small City","Medium City","Large City"))
CHINA$community_type[CHINA$CN_SIZE_numeric == 7] <- "Rural"
CHINA$community_type[CHINA$CN_SIZE_numeric == 6 | CHINA$CN_SIZE_numeric == 5] <- "Small City"
CHINA$community_type[CHINA$CN_SIZE_numeric == 4 | CHINA$CN_SIZE_numeric == 3] <- "Medium City"
CHINA$community_type[CHINA$CN_SIZE_numeric == 2 | CHINA$CN_SIZE_numeric == 1] <- "Large City"

Dependent variable: total child

Independent variables: total siblings, family income, community_type

Here I define the community_type variable in the following way:

Cities with population < 500,000 are small cities; Cities with population from 500,000 to 2,000,000 are medium cities; Cities with population > 2,000,000 are large cities.

3 Summary and Visualization

3.1 Descriptive Statistics

3.1.1 Structure of Interesting Variables

CHINA %>%
  select(total_child, total_siblings, CN_HINC, community_type) %>%
  str()
## 'data.frame':    3208 obs. of  4 variables:
##  $ total_child   : num  NA 2 2 1 NA 1 NA NA 1 2 ...
##   ..- attr(*, "value.labels")= Named num 0
##   .. ..- attr(*, "names")= chr "No sons"
##  $ total_siblings: num  NA 6 7 5 NA 11 NA NA 5 11 ...
##   ..- attr(*, "value.labels")= Named num 
##   .. ..- attr(*, "names")= chr 
##  $ CN_HINC       : num  NA 24000 24000 120000 NA 24000 82000 0 50000 7000 ...
##   ..- attr(*, "value.labels")= Named num 
##   .. ..- attr(*, "names")= chr 
##  $ community_type: Factor w/ 4 levels "Rural","Small City",..: 4 4 4 4 4 4 4 4 4 4 ...

3.1.2 Summary of Dependent Variable

CHINA %>%
  select(total_child)%>%
stargazer(type = "html")
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
total_child 2,632 1.787 1.008 1.000 1.000 2.000 9.000

3.1.3 Summary of Independent variables

#Quantitative Variables
CHINA %>%
  select(total_siblings,CN_HINC,community_type)%>%
stargazer(type = "html",
          flip = T)
Statistic total_siblings CN_HINC
N 2,706 2,891
Mean 8.210 23,315.050
St. Dev. 3.136 34,871.860
Min 2.000 0.000
Pctl(25) 6.000 8,000.000
Pctl(75) 10.000 30,000.000
Max 22.000 1,000,000.000
#Categorical Variable
summary(CHINA$community_type)
##       Rural  Small City Medium City  Large City 
##        1012         572         673         951

From the summary, we see that the dataset includes >500 respondents (sample size) from each community type. We can say with confidence that our finding will be representative across all community types.

3.2 Visualizations

3.2.1 Community Types

CHINA %>%
  group_by(community_type)%>%
  summarise(mean_child = mean(total_child, na.rm = T)) %>%
  plot_ly(x = ~community_type, y = ~mean_child, type = 'bar',
          marker = list(color = 'gold',
                        line = list(color = 'navy',
                                    width = 1.5))
          ) %>%
  layout(title = "Does community Type Correlate with Number of Children?",
         xaxis = list(title = "Community Types"),
         yaxis = list(title ="Average Number of Children in Each Family"))

This graph demonstrates the differences among families in different regions. Rural communities clearly tend to have more children. The average number of children in a rural family is more than 2, while the average is lower than 2 for city families. The number of children decreaes as we look at larger cities. It is clear that the more urban the community is, the fewer kids parents will likely to have. This is evidence that our Hypothesis 3 might be true. If so, we should also control for this important variable (community type) as we proceed to investigate other variables.

3.2.2 Parents’ Siblings

CHINA%>%
  ggplot(aes(x=total_siblings,y=total_child,color=community_type))+
  geom_jitter(width = 2.5)+
  geom_smooth(method = "lm") +
  labs(title = "Does Growing up with Siblings Increase Aptitude for More Children?",
       x = "Total Siblings of Parents",
       y = "Number of Children",
       color = "Community Type") +
  theme_minimal()

From this graph, we observe a positive relationship between the number of siblings of the parents and the number of children they have. This relationship is valid across all community types. We can also see that rural families have more children overall, while families from large cities have the least children. But for all communities, having more siblings as a parent always increases one’s aptitude for more children. This is evidence in support of Hypothesis 1.

3.2.3 Family Income

CHINA%>%
  ggplot(aes(x=I(log(CN_HINC)),y=total_child,color=community_type))+
  geom_jitter(width = 2.5)+
  geom_smooth(method = "lm") +
  labs(title = "Does Higer Income Increase Aptitude for More Children?",
       x = "Log(Family Income)",
       y = "Number of Children",
       color = "Community Type") +
  theme_minimal()

From this graph, we do not observe a positive relationship between income and the number of children. In fact, we see a potentially negative relationship. This is evidence against hypothesis 2. Still, Rural families tend to have the most children Note that I put family income on a logistic scale to better present the data distribution.

4 Regression Analysis

4.1 Cleaning Variables

4.1.1 Creating Dummy Variables for Community Type

Because community type is a categorical variable, I’ll create separate dummy variables so that the regression analysis can include different reference groups. The variable value is 1 if the family belongs to that community.

#Creating dummy variables
CHINA$rural <- 0
CHINA$small_city <- 0
CHINA$medium_city <- 0
CHINA$large_city <- 0
CHINA$rural[CHINA$community_type == "Rural"] <- 1 
CHINA$small_city[CHINA$community_type == "Small City"]<- 1 
CHINA$medium_city[CHINA$community_type == "Medium City"]<- 1 
CHINA$large_city[CHINA$community_type == "Large City"]<- 1 

4.1.2 Taking Natural Log on the Income Variable

I take a natural log on the income variable because the ditribution of income is highly right-skewed. As demonstrated in previous visualizations, income has a more linear relationship with our outcome variable when putting on a logistic scale.

CHINA<-CHINA %>%
  mutate(CN_HINC_LOG = log(CN_HINC))

#Removing NAN and Inf
CHINA$CN_HINC_LOG[which(is.nan(CHINA$CN_HINC_LOG))] <- NA
CHINA$CN_HINC_LOG[which(CHINA$CN_HINC_LOG==Inf)] <- NA
CHINA$CN_HINC_LOG[which(CHINA$CN_HINC_LOG=="-Inf")] <- NA

4.2 Multiple Linear Regression Models

I build four models that predict the number of children in each family, based on community type, parents’ siblings, and family income. The four models each represent a type of community. Thus, those models can demonstrate both the differences among community types and the effect of siblings and income in each community.

\[ Model1: Total Children = \beta_0+\beta_1*Siblings + \beta_2*Income+\beta_3*Rural+u\] \[ Model2: Total Children = \beta_0+\beta_1*Siblings + \beta_2*Income+\beta_3*SmallCity+u\] \[ Model3: Total Children = \beta_0+\beta_1*Siblings + \beta_2*Income+\beta_3*MediumCity+u\] \[ Model4: Total Children = \beta_0+\beta_1*Siblings + \beta_2*Income+\beta_3*LargeCity+u\]

lm_rural<-lm(total_child~total_siblings+CN_HINC_LOG+rural,data = CHINA)

lm_mediumcity<-lm(total_child~total_siblings+CN_HINC_LOG+medium_city,data = CHINA)

lm_largecity<-lm(total_child~total_siblings+CN_HINC_LOG+large_city,data = CHINA)

lm_smallcity<-lm(total_child~total_siblings+CN_HINC_LOG+small_city,data = CHINA)

4.3 Regression Results

Now I present the four models in one table:

stargazer(lm_rural,lm_smallcity,lm_mediumcity,lm_largecity,
          type = "html",align=TRUE,
          font.size= "large",
          no.space=TRUE,
          dep.var.labels="<b>Total Number of Children</b>",
          covariate.labels=c("<b>Parents' Siblings</b>", "<b>Log(Family Income)</b>", "<b>Rural</b>", "<b>Small City</b>","<b>Medium City</b>","<b>Large City</b>")
          )
Dependent variable:
Total Number of Children
(1) (2) (3) (4)
Parents’ Siblings 0.057*** 0.065*** 0.066*** 0.059***
(0.006) (0.007) (0.006) (0.006)
Log(Family Income) -0.124*** -0.232*** -0.225*** -0.178***
(0.023) (0.021) (0.022) (0.023)
Rural 0.480***
(0.045)
Small City -0.057
(0.050)
Medium City -0.138***
(0.049)
Large City -0.363***
(0.050)
Constant 2.331*** 3.480*** 3.424*** 3.095***
(0.242) (0.221) (0.222) (0.225)
Observations 2,389 2,389 2,389 2,389
R2 0.140 0.100 0.103 0.119
Adjusted R2 0.139 0.099 0.102 0.118
Residual Std. Error (df = 2385) 0.944 0.966 0.964 0.955
F Statistic (df = 3; 2385) 129.466*** 88.612*** 91.033*** 107.720***
Note: p<0.1; p<0.05; p<0.01

From the regressions, we can draw the following conclusions:

  1. Rural families have significantly more children than urban families. Model 1 tells us that being a rural family increases the expected number of children by nearly 0.5, keeping other variables constant. This finding is evidence in support of Hypothesis 3.

  2. The more siblings the parents grow up with, the more children they are likely to have when they become adults. This effect is true across all community types, as we observe statistically positive coefficients in all four models. This finding is evidence in support of Hypothesis 1.

  3. Family income has a negative relationship with the number of children. The relationship is consistent across all community types. But the effect size is larger for small and medium cities, smaller for rural area and large cities. Note that these coefficients are not interpretable because we have put the income variable on a logistic scale. But regardless, this result is evidence against Hypothesis 2.

5 Additional Inference

5.1 F-Test

A F-Test was performed for each of the four models (in above regression table). In each one, we observe an F statistic larger than 80 with degrees of freedom 3 and 2385, all of which correspond to p-values smaller than 0.01. Therefore, we can reject the null that coefficients for all the regressors are jointly zero. We conclude that it is correct to include these variables in the models.

5.2 Confidence Intervals

5.2.1 Model 1

confint(lm_rural,level = 0.99)
##                      0.5 %      99.5 %
## (Intercept)     1.70743197  2.95512072
## total_siblings  0.04035132  0.07337517
## CN_HINC_LOG    -0.18438449 -0.06382792
## rural           0.36316167  0.59756028

5.2.2 Model 2

confint(lm_smallcity,level = 0.99)
##                      0.5 %      99.5 %
## (Intercept)     2.90914267  4.05019906
## total_siblings  0.04846583  0.08198352
## CN_HINC_LOG    -0.28768657 -0.17698936
## small_city     -0.18649876  0.07190424

5.2.3 Model 3

confint(lm_mediumcity,level = 0.99)
##                      0.5 %      99.5 %
## (Intercept)     2.85183942  3.99544510
## total_siblings  0.04921059  0.08269047
## CN_HINC_LOG    -0.28102204 -0.16958200
## medium_city    -0.26594305 -0.01102685

5.2.4 Model 4

confint(lm_largecity,level = 0.99)
##                     0.5 %      99.5 %
## (Intercept)     2.5148307  3.67561000
## total_siblings  0.0417539  0.07526359
## CN_HINC_LOG    -0.2363262 -0.12013591
## large_city     -0.4911128 -0.23409916

Conclusion: The 99% confidence intervals for all regressors in all four models do not include 0, indicating that those coefficients are significantly different from 0.

5.3 Additional T-Test

From the table, we can see a clear difference between rural communities and large cities. However, comparing Model 2 and Model 3, it is unclear whether small cities differ from medium cities. Therefore, I run a two-sample t-test to investigate whether cities with population from 500,000 to 2,000,000 are different from cities with population < 500,000 in terms of having more or fewer children.

#First, I filter the data so it only include observations from small cities and medium cities.
CHINA_filtered<- CHINA %>%
  filter(community_type == "Small City" | community_type == "Medium City")

#Performing t-test
t.test(CHINA_filtered$total_child~CHINA_filtered$community_type,conf.level = 0.99,var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  CHINA_filtered$total_child by CHINA_filtered$community_type
## t = 0.68513, df = 1008.3, p-value = 0.4934
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  -0.1138297  0.1961141
## sample estimates:
##  mean in group Small City mean in group Medium City 
##                  1.713427                  1.672285

From the result of the two-sample t-test, we get a p-value of 0.49, much larger than 0.05. Therefore, we fail to reject the null that the mean number of children in the two groups are the same (difference = 0). We conclude that there is no significant difference between small and medium cities, in terms of parents’ willingness to have more children.

6 Conclusion

In this project, I have investigated the relationships between Chinese family structures and the willingness to have more kids. I visualized the relationships, constructed multiple regression models, and performed statistical inferences. Three hypotheses are proposed and studied. From the results of this project, I come to the conclusion that for Chinese parents:

  1. All other variables controlled, growing up with more siblings significantly increases aptitude for more kids. This finding is true across all community types.

  2. Higher family income does not encourage parents to have more kids. In fact, there is a potentially negative relationship between income and the number of children, even when controlling for rural/urban differences. This might be related to norms and education. More studies are needed to reach further conclusions.

  3. There is a huge rural/urban distinction. Rural families tend to have significantly more children than urban families. This effect is also very consistent controlling for all other variables. While rural communities and large cities see the most distinction, there is, however, little difference between small and medium cities. This is an interesting finding, indicating a significant change in willingness to have children when families move from rural to cities, but little difference when the cities are a little smaller or larger.